import pandas as pdText Manipulation Methods in pandas
Text Manipulation Methods in pandas
This notebook explores pandas string methods (accessed via .str) for manipulating text data in DataFrames. Covers case changes, searching, regex, replacement, and splitting.
Introduction
Pandas provides vectorized string operations through the .str accessor. These methods work on Series of strings and are efficient for text data processing.
Sample Data
We’ll use a simple DataFrame with text data to demonstrate string methods.
data = {
'TextData': ['Hello','World','Python', 'Pandas', 'Data Science']
}
df = pd.DataFrame(data)
df| TextData | |
|---|---|
| 0 | Hello |
| 1 | World |
| 2 | Python |
| 3 | Pandas |
| 4 | Data Science |
Case Conversion
Convert text to lowercase or uppercase using .str.lower() and .str.upper().
df['LowerCase'] = df['TextData'].str.lower()
df| TextData | LowerCase | |
|---|---|---|
| 0 | Hello | hello |
| 1 | World | world |
| 2 | Python | python |
| 3 | Pandas | pandas |
| 4 | Data Science | data science |
df['UpperCase'] = df['TextData'].str.upper()
df| TextData | LowerCase | UpperCase | |
|---|---|---|---|
| 0 | Hello | hello | HELLO |
| 1 | World | world | WORLD |
| 2 | Python | python | PYTHON |
| 3 | Pandas | pandas | PANDAS |
| 4 | Data Science | data science | DATA SCIENCE |
Searching in Text
Check if strings contain substrings with .str.contains(). Use case=False for case-insensitive search.
df['Contains'] = df['TextData'].str.contains('O', case=False)
df| TextData | LowerCase | UpperCase | Contains | |
|---|---|---|---|---|
| 0 | Hello | hello | HELLO | True |
| 1 | World | world | WORLD | True |
| 2 | Python | python | PYTHON | True |
| 3 | Pandas | pandas | PANDAS | False |
| 4 | Data Science | data science | DATA SCIENCE | False |
Regular Expressions (Regex)
Use regex with methods like .str.findall() to find patterns. Here, finding all ‘o’ characters.
df['Matches'] = df['TextData'].str.findall('o')
df| TextData | LowerCase | UpperCase | Contains | Matches | |
|---|---|---|---|---|---|
| 0 | Hello | hello | HELLO | True | [o] |
| 1 | World | world | WORLD | True | [o] |
| 2 | Python | python | PYTHON | True | [o] |
| 3 | Pandas | pandas | PANDAS | False | [] |
| 4 | Data Science | data science | DATA SCIENCE | False | [] |
Replacement and Splitting
Replace substrings with .str.replace() and split strings with .str.split().
df['Replaced'] = df['TextData'].str.replace('o', 'x')
df| TextData | LowerCase | UpperCase | Contains | Matches | Replaced | |
|---|---|---|---|---|---|---|
| 0 | Hello | hello | HELLO | True | [o] | Hellx |
| 1 | World | world | WORLD | True | [o] | Wxrld |
| 2 | Python | python | PYTHON | True | [o] | Pythxn |
| 3 | Pandas | pandas | PANDAS | False | [] | Pandas |
| 4 | Data Science | data science | DATA SCIENCE | False | [] | Data Science |
df['Split'] = df['TextData'].str.split(' ')
df| TextData | LowerCase | UpperCase | Contains | Matches | Replaced | Split | |
|---|---|---|---|---|---|---|---|
| 0 | Hello | hello | HELLO | True | [o] | Hellx | [Hello] |
| 1 | World | world | WORLD | True | [o] | Wxrld | [World] |
| 2 | Python | python | PYTHON | True | [o] | Pythxn | [Python] |
| 3 | Pandas | pandas | PANDAS | False | [] | Pandas | [Pandas] |
| 4 | Data Science | data science | DATA SCIENCE | False | [] | Data Science | [Data, Science] |
Best Practices
- Handle missing values: Use
.strmethods which handle NaN gracefully. - For complex regex, test patterns separately.
- Vectorized operations are faster than loops.
Summary
This notebook covered essential pandas string methods for text manipulation. Experiment with real datasets to master these techniques!